Data Science Methods and Quantitative Analysis - Home Credit Default Risk [90 Marks]
Deadline: February 11 at 21:00
Academic Integrity
This project is individual: It is to be completed on your own. Do not share your code with others, or post any parts of your work online. You can only submit code that you have written yourself. If you use any online resource for developing parts of your code, acknowledge the source in a comment in your code. Students suspected of plagiarism on a project will be referred to the university for formal discipline according to the Code of Behaviour on Academic Matters.
Please fill out the following:
Load the six csv files as pandas dataframes using a string 'path' for the location of files on your system (to be then updated by the marker for evaluation)
# Import necessary packages
import pandas as pd
import numpy as np
import math
from varclushi import VarClusHi
from sklearn import feature_selection
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, train_test_split, RandomizedSearchCV, KFold, GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, roc_auc_score, plot_confusion_matrix, plot_roc_curve, f1_score, precision_score, recall_score, auc, roc_curve, confusion_matrix, classification_report
import os
import gc
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from xgboost import XGBClassifier, plot_importance
import sys
!pwd
### YOUR CODE HERE ###
#e.g. application_train
application_train = pd.read_csv('data/home-credit-default-risk/application_train.csv')
bureau = pd.read_csv('data/home-credit-default-risk/bureau.csv')
bureau_balance = pd.read_csv('data/home-credit-default-risk/bureau_balance.csv')
POS_CASH_balance = pd.read_csv('data/home-credit-default-risk/POS_CASH_balance.csv')
credit_card_balance = pd.read_csv('data/home-credit-default-risk/credit_card_balance.csv')
installments_payments = pd.read_csv('data/home-credit-default-risk/installments_payments.csv')
print('shape of application_train',application_train.shape)
print('shape of bureau',bureau.shape)
print('shape of bureau_balance',bureau_balance.shape)
print('shape of POS_CASH_balance',POS_CASH_balance.shape)
print('shape of credit_card_balance',credit_card_balance.shape)
print('shape of installments_payments',installments_payments.shape)
To conduct data cleaning, the first step is create and select relevant features. This includes the following sub-steps:
See details in the project handout.
bureau_balance
# calculate the proportion of 0 in the variable STATUS. (do I need to create any new variable?)
proportion_of_0= (bureau_balance['STATUS'].values == '0').sum()/(len(bureau_balance['STATUS']))
print('proportion_of_0 in the total dataset:',proportion_of_0)
def status_to_numric_0(x):
if x=='0':
return 0
else:
return 1
# calculate the proportion of 0 in the variable STATUS. (do I need to create any new variable?)
## 1. there are some id that doesn't have 0, so it will end up nan, so we need to transform it numric value
## 2. we use sum / count to get non-zero proportion, then we can get zero proportion as 1- (non-zero proportion)
bureau_balance['0_or_not']=bureau_balance['STATUS'].apply(status_to_numric_0)
a=bureau_balance.groupby('SK_ID_BUREAU')['STATUS'].count().rename('status_total').reset_index()
b=bureau_balance.groupby('SK_ID_BUREAU')['0_or_not'].sum().rename('status_sum').reset_index()
c=pd.merge(a,b,on='SK_ID_BUREAU')
c['0_prop']=1-c['status_sum']/c['status_total']
bureau_balance=pd.merge(c,bureau_balance,on='SK_ID_BUREAU')
bureau_balance
# print(bureau_balance.loc[bureau_balance['SK_ID_BUREAU']==5715448]) ## this is to check whether i am calculating correct
# Drop MONTHS_BALANCE as it’s irrelevant to the clients’ behaviour
bureau_balance_d=bureau_balance.drop('MONTHS_BALANCE',1)
bureau_balance_d=bureau_balance.drop(['status_sum','status_total','0_or_not'],1) ## we also want to remove these intermediate variables
bureau_balance_d
# merge bureau_balance to bureau using the key SK_ID_BUREAU
bureau_combine=bureau.merge(bureau_balance_d,left_on='SK_ID_BUREAU',right_on='SK_ID_BUREAU')
bureau_combine.columns,bureau.columns
bureau_combine
bureau_combine.columns
# For each SK_ID_CURR, calculate the proportion of Closed and Active of CREDIT_ACTIVE. (do we have to do anything with this calculation? here is the total percentage)
proportion_of_closed= (bureau_combine['CREDIT_ACTIVE'].values == 'Closed').sum()/(len(bureau_combine['CREDIT_ACTIVE']))
proportion_of_active= (bureau_combine['CREDIT_ACTIVE'].values == 'Active').sum()/(len(bureau_combine['CREDIT_ACTIVE']))
proportion_of_closed, proportion_of_active
bureau_combine.CREDIT_ACTIVE.value_counts()
def status_to_numric_actclos(x):
if x=='Closed' or x=='Active':
return 1
else:
return 0
bureau_combine['clos_active_or_not']=bureau_combine['CREDIT_ACTIVE'].apply(status_to_numric_actclos)
bureau_combine
a=bureau_combine.groupby('SK_ID_CURR')['CREDIT_ACTIVE'].count().rename('status_total').reset_index()
b=bureau_combine.groupby('SK_ID_CURR')['clos_active_or_not'].sum().rename('status_sum').reset_index()
c=pd.merge(a,b,on='SK_ID_CURR')
c['clos_active_prop']=c['status_sum']/c['status_total']
bureau_combine=pd.merge(c,bureau_combine,on='SK_ID_CURR')
bureau_combine
# print(bureau_combine.loc[bureau_combine['SK_ID_CURR']==456255].CREDIT_ACTIVE) ## this is to check whether i am calculating correct
# print(bureau_combine.loc[bureau_combine['SK_ID_CURR']==456255].CREDIT_ACTIVE.value_counts) ## this is to check whether i am calculating correct
bureau_combine=bureau_combine.drop(['status_sum','status_total','clos_active_or_not'],1) ## we also want to remove these intermediate variables
# Remove CREDIT_CURRENCY as it’s constant.
bureau_combine = bureau_combine.drop('CREDIT_CURRENCY',1)
bureau_combine.columns
# Remove DAYS_CREDIT, CREDIT_DAY_OVERDUE, DAYS_CREDIT_ENDDATE, DAYS_ENDDATE_FACT and
# DAYS_CREDIT_UPDATE as these are irrelevant variables.
bureau_combine = bureau_combine.drop(['DAYS_CREDIT','CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT','DAYS_CREDIT_UPDATE'],1)
bureau_combine.columns
bureau_combine
# no longer need this key ?
# bureau_combine=bureau_combine.drop(['SK_ID_BUREAU'], axis = 1)
bureau_combine.shape
bureau_combine.CREDIT_TYPE.value_counts()
def credit_to_numric(x):
if x=='Consumer credit' or x=='Credit card':
return 1
else:
return 0
bureau_combine['cons_cred_card_or_not']=bureau_combine['CREDIT_TYPE'].apply(credit_to_numric)
bureau_combine
# For each SK_ID_CURR, calculate the proportion of Consumer credit and Credit card of CREDIT_TYPE.
a=bureau_combine.groupby('SK_ID_CURR')['CREDIT_TYPE'].count().rename('status_total').reset_index()
b=bureau_combine.groupby('SK_ID_CURR')['cons_cred_card_or_not'].sum().rename('status_sum').reset_index()
c=pd.merge(a,b,on='SK_ID_CURR')
c['cons_cred_card_prop']=c['status_sum']/c['status_total']
bureau_combine=pd.merge(c,bureau_combine,on='SK_ID_CURR')
bureau_combine=bureau_combine.drop(['status_sum','status_total','cons_cred_card_or_not'],1) ## we also want to remove these intermediate variables
print(bureau_combine.loc[bureau_combine['SK_ID_CURR']==456255].cons_cred_card_prop.value_counts()) ## this is to check whether i am calculating correct
print(bureau_combine.loc[bureau_combine['SK_ID_CURR']==456255].CREDIT_TYPE.value_counts()) ## this is to check whether i am calculating correct
# bureau_combine
# calculate the average for the numerical variable per SK_ID_CURR and use the calculated average
# to present the corresponding feature value for that SK_ID_CURR For example, the variable that
# you created based on the proportion of 0 in the variable STATUS is to be averaged for each unique
# SK_ID_CURR before being merged with application_train.
bureau_num = bureau_combine.groupby(by=['SK_ID_CURR']).mean().reset_index() # group the numeric features by SK_ID_CURR
print(bureau_num.shape, "- shape of numeric bureau features (incl index)")
bureau_cat = pd.get_dummies(bureau_combine.select_dtypes('object')) # this got rid of the SK_ID_CURR column ...
bureau_cat['SK_ID_CURR'] = bureau_combine['SK_ID_CURR'] # so we have to replace it
bureau_cat = bureau_cat.groupby(by = ['SK_ID_CURR']).mean().reset_index() # tried sum - didn't change anything
print(bureau_cat.shape, "- shape of categorical bureau features (incl index)") # should be
bureau_cat
bureau_num
bureau_combine
bureau_combine_d=bureau_combine.drop(['CREDIT_ACTIVE','CREDIT_TYPE','STATUS'],1) ## we also want to remove these intermediate variables
bureau_combine_d=bureau_combine_d.groupby(by=['SK_ID_CURR']).mean().reset_index() # group the numeric features by SK_ID_CURR by average
bureau_combine_d ## do we need to fill in nan??
len(bureau_combine_d.SK_ID_CURR.unique())
### YOUR CODE HERE ###
installments_payments
# Combine AMT_INSTALMENT and AMT_PAYMENT into one variable indicating whether the client has
# paid out the installment on time. One new dummy variable should be created based on the following criterion: if AMT_PAYMENT ≥ AMT_INSTALMENT, then the dummy equals 1, otherwise it equals 0.
paydiff= installments_payments['AMT_PAYMENT']-installments_payments['AMT_INSTALMENT']
installments_payments['paid_on_time'] = paydiff.apply(lambda x: 1 if x>=0 else 0)
# plt.plot(a)
installments_payments['paid_on_time']
print('percent of paid on time', installments_payments['paid_on_time'].sum()/len(installments_payments['paid_on_time']))
print('maximum paydiff',paydiff.max(skipna = True))
## drop AMT_INSTALMENT and AMT_PAYMENT because we already have that dummy variable now
installments_payments_new = installments_payments.drop(['AMT_PAYMENT','AMT_INSTALMENT'],1)
# Then, calculate the maximum of that dummy variable for each SK_ID_CURR.
maxima = installments_payments_new.groupby('SK_ID_CURR')['paid_on_time'].max()
installments_payments_new['max_paid_on_time']=installments_payments_new['SK_ID_CURR'].map(maxima)
installments_payments_new
installments_payments_u=installments_payments_new.groupby(by=['SK_ID_CURR']).mean().reset_index()
len(installments_payments_u.SK_ID_CURR.unique()) ## there would be some duplicate skidcurr
len(installments_payments_new.SK_ID_CURR.unique()) ## there would be some duplicate skidcurr
credit_card_balance
# For the variables AMT_DRAWINGS_ATM_CURRENT, AMT_DRAWINGS_CURRENT, AMT_DRAWINGS_OTHER_CURRENT,
# AMT_DRAWINGS_POS_CURRENT, AMT_RECIVABLE, and AMT_TOTAL_RECEIVABLE replace all NA with 0
cols= ['AMT_DRAWINGS_ATM_CURRENT', 'AMT_DRAWINGS_CURRENT','AMT_DRAWINGS_OTHER_CURRENT','AMT_DRAWINGS_POS_CURRENT','AMT_RECIVABLE','AMT_TOTAL_RECEIVABLE']
credit_card_balance[cols]=credit_card_balance[cols].fillna(0)
credit_card_balance
len(credit_card_balance.SK_ID_CURR.unique())
# then compute average for each SK_ID_CURR.
ccb_num = credit_card_balance.groupby(by=['SK_ID_CURR']).mean().reset_index() # group the numeric features by SK_ID_CURR
ccb_num
credit_card_balance['NAME_CONTRACT_STATUS'].value_counts()
### how to deal with this categorical?
ccb_cat = pd.get_dummies(credit_card_balance.select_dtypes('object')) # this got rid of the SK_ID_CURR column ...
ccb_cat
ccb_cat['SK_ID_CURR'] = credit_card_balance['SK_ID_CURR'] # so we have to replace it
ccb_cat = ccb_cat.groupby(by = ['SK_ID_CURR']).mean().reset_index() # could try sum as wel
ccb_cat
credit_card_balance_u=pd.merge(ccb_num,ccb_cat,on='SK_ID_CURR')
len(credit_card_balance_u)==len(credit_card_balance_u.SK_ID_CURR.unique())
POS_CASH_balance
# Pick variable SK_DPD, replace NA with 0 if any, and make it into a dummy variable based on the
# following criterion: If the original value is 0 then keep it as 0, else replace original value with 1.
POS_CASH_balance['SK_DPD']=POS_CASH_balance['SK_DPD'].fillna(0)
POS_CASH_balance['SK_DPD'] = POS_CASH_balance['SK_DPD'].apply(lambda x: 1 if x>0 else 0)
POS_CASH_balance
# Then, take the maximum of the newly created dummy variable for each SK_ID_CURR.
maxima = POS_CASH_balance.groupby('SK_ID_CURR')['SK_DPD'].max()
POS_CASH_balance['max_SK_DPD']=POS_CASH_balance['SK_ID_CURR'].map(maxima)
POS_CASH_balance
# Replace NA with 0 for CNT_INSTALMENT_FUTURE, and take its average for each SK_ID_CURR.
POS_CASH_balance['CNT_INSTALMENT_FUTURE']=POS_CASH_balance['CNT_INSTALMENT_FUTURE'].fillna(0)
POS_CASH_balance
POS_CASH_balance_u=POS_CASH_balance.groupby(by=['SK_ID_CURR']).mean().reset_index()
len(POS_CASH_balance_u)==len(POS_CASH_balance.SK_ID_CURR.unique())
Report the shape of the combined dataset.
application_train
### YOUR ANSWER HERE ###
## combine these dataset : bureau_combine_d,installments_payments_u,POS_CASH_balance_u,credit_card_balance_u,
application_train_m1=pd.merge(application_train,bureau_combine_d,on='SK_ID_CURR')
application_train_m2=pd.merge(application_train_m1,installments_payments_u,on='SK_ID_CURR')
application_train_m3=pd.merge(application_train_m2,POS_CASH_balance_u,on='SK_ID_CURR')
application_train_m=pd.merge(application_train_m3,credit_card_balance_u,on='SK_ID_CURR')
print('the shape of the combined dataset is:',application_train_m.shape)
application_train_m.to_csv('/home/tianyu/code/project1/data/home-credit-default-risk/application_m.csv')
application_train_m
Provide a data dictionary (i.e. an organized list of variables you have added to the main dataset and their descriptions) before performing any statistical modelling.
### YOUR ANSWER HERE ###
print('columns of application train data',application_train_m.columns)
value_description = {
"clos_active_prop": "the proportion of 0 in the variable STATUS. It is an indication of good_status of credit bureau_loan",
"closd_active_prop": "the proportion of Closed and Active of CREDIT_ACTIVE",
"cons_cred_card_prop": 'the proportion of Consumer credit and Credit card of CREDIT_TYPE',
"paid_on_time":'One new dummy variable should be created based on the following criterion: if AMT_PAYMENT ≥ AMT_INSTALMENT, then the dummy equals 1, otherwise it equals 0.',
"SK_DPD": 'If the original value SK_DPD is 0 then keep it as 0, else replace original value with 1.'
}
for keys,values in value_description.items():
print('variable:',keys)
print('description:',values)
Determine the response variable in your notebook and explain in plain English what it represents. Provide a Seaborn plot for the response variable and one arbitrary pair of the features that you have created in the previous parts. Use appropriate figure size, title/caption, legend, and axis titles.
### YOUR CODE HERE ###
#Look into the function and corresponding parameters, sns.relplot;
# Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)
import cufflinks as cf
cf.go_offline()
cf.set_config_file(theme='polar')
application_train['TARGET'].value_counts()
target_val = application_train['TARGET'].value_counts()
target_df = pd.DataFrame({'labels': target_val.index,
'values': target_val.values
})
target_df.iplot(kind='pie',labels='labels',values='values', title='Types of target', hole = 0.6)
Explanation: TARGET indicating 0: the loan was repaid or 1: the loan was not repaid.The data is imbalanced (91.9%(Loan repayed-0) and 8.07%(Loan not repayed-1)) and we need to handle this problem.
contract_val = application_train['NAME_CONTRACT_TYPE'].value_counts()
contract_df = pd.DataFrame({'labels': contract_val.index,
'values': contract_val.values
})
contract_df.iplot(kind='pie',labels='labels',values='values', title='Types of Loan', hole = 0.6)
Explanation: Many people are willing to take cash loan than revolving loan
application_train[application_train['AMT_INCOME_TOTAL'] < 2000000]['AMT_INCOME_TOTAL'].iplot(kind='histogram', bins=100,
xTitle = 'Total Income', yTitle ='Count of applicants',
title='Distribution of AMT_INCOME_TOTAL')
Explanation: 1. The distribution is right skewed and there are extreme values, we can apply log distribution.
Fit a logistic regression model using the selected explanatory variables and the training data.
# Split the combined dataset into a training (70%) and testing set (30%)
y = application_train_m.pop('TARGET').values
X_train, X_temp, y_train, y_temp = train_test_split(application_train_m.drop(['SK_ID_CURR'],axis=1), y, stratify = y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, stratify = y_temp, test_size=0.5, random_state=42)
print('Shape of X_train:',X_train.shape)
print('Shape of X_val:',X_val.shape)
print('Shape of X_test:',X_test.shape)
# Seperation of columns into numeric and categorical columns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
types = np.array([dt for dt in X_train.dtypes])
all_columns = X_train.columns.values
is_num = types != 'object'
num_cols = all_columns[is_num]
cat_cols = all_columns[~is_num]
# Featurization of numeric data
imputer_num = SimpleImputer(strategy='median')
X_train_num = imputer_num.fit_transform(X_train[num_cols])
X_val_num = imputer_num.transform(X_val[num_cols])
X_test_num = imputer_num.transform(X_test[num_cols])
scaler_num = StandardScaler()
X_train_num1 = scaler_num.fit_transform(X_train_num)
X_val_num1 = scaler_num.transform(X_val_num)
X_test_num1 = scaler_num.transform(X_test_num)
X_train_num_final = pd.DataFrame(X_train_num1, columns=num_cols)
X_val_num_final = pd.DataFrame(X_val_num1, columns=num_cols)
X_test_num_final = pd.DataFrame(X_test_num1, columns=num_cols)
# Featurization of categorical data
imputer_cat = SimpleImputer(strategy='constant', fill_value='MISSING')
X_train_cat = imputer_cat.fit_transform(X_train[cat_cols])
X_val_cat = imputer_cat.transform(X_val[cat_cols])
X_test_cat = imputer_cat.transform(X_test[cat_cols])
X_train_cat1= pd.DataFrame(X_train_cat, columns=cat_cols)
X_val_cat1= pd.DataFrame(X_val_cat, columns=cat_cols)
X_test_cat1= pd.DataFrame(X_test_cat, columns=cat_cols)
ohe = OneHotEncoder(sparse=False,handle_unknown='ignore')
X_train_cat2 = ohe.fit_transform(X_train_cat1)
X_val_cat2 = ohe.transform(X_val_cat1)
X_test_cat2 = ohe.transform(X_test_cat1)
cat_cols_ohe = list(ohe.get_feature_names(input_features=cat_cols))
X_train_cat_final = pd.DataFrame(X_train_cat2, columns = cat_cols_ohe)
X_val_cat_final = pd.DataFrame(X_val_cat2, columns = cat_cols_ohe)
X_test_cat_final = pd.DataFrame(X_test_cat2, columns = cat_cols_ohe)
# Final complete data
X_train_final = pd.concat([X_train_num_final,X_train_cat_final], axis = 1)
X_val_final = pd.concat([X_val_num_final,X_val_cat_final], axis = 1)
X_test_final = pd.concat([X_test_num_final,X_test_cat_final], axis = 1)
print(X_train_final.shape)
print(X_val_final.shape)
print(X_test_final.shape)
# Saving the numpy arrays into text files for future use
np.savetxt('y.txt', y)
np.savetxt('y_train.txt', y_train)
np.savetxt('y_val.txt', y_val)
np.savetxt('y_test.txt', y_test)
application_test = pd.read_csv('data/home-credit-default-risk/application_test.csv')
print('testset shape',application_test.shape)
print('train set shape',application_train_m.shape)
def cv_plot(alpha, cv_auc):
fig, ax = plt.subplots()
ax.plot(np.log10(alpha), cv_auc,c='g')
for i, txt in enumerate(np.round(cv_auc,3)):
ax.annotate((alpha[i],str(txt)), (np.log10(alpha[i]),cv_auc[i]))
plt.grid()
plt.xticks(np.log10(alpha))
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
from sklearn.linear_model import SGDClassifier
from sklearn.calibration import CalibratedClassifierCV
## parameter tuning
alpha = np.logspace(-4,4,9)
cv_auc_score = []
for i in alpha:
clf = SGDClassifier(alpha=i, penalty='l1',class_weight = 'balanced', loss='log', random_state=28)
clf.fit(X_train_final, y_train)
sig_clf = CalibratedClassifierCV(clf, method='sigmoid')
sig_clf.fit(X_train_final, y_train)
y_pred_prob = sig_clf.predict_proba(X_val_final)[:,1]
cv_auc_score.append(roc_auc_score(y_val,y_pred_prob))
print('For alpha {0}, cross validation AUC score {1}'.format(i,roc_auc_score(y_val,y_pred_prob)))
cv_plot(alpha, cv_auc_score)
print('The Optimal C value is:', alpha[np.argmax(cv_auc_score)])
# best_alpha = alpha[np.argmax(cv_auc_score)]
best_alpha=0.01
logreg = SGDClassifier(alpha = best_alpha, class_weight = 'balanced', penalty = 'l1', loss='log', random_state = 28)
logreg.fit(X_train_final, y_train)
logreg_sig_clf = CalibratedClassifierCV(logreg, method='sigmoid')
logreg_sig_clf.fit(X_train_final, y_train)
y_pred_prob = logreg_sig_clf.predict_proba(X_train_final)[:,1]
print('For best alpha {0}, The Train AUC score is {1}'.format(best_alpha, roc_auc_score(y_train,y_pred_prob) ))
y_pred_prob = logreg_sig_clf.predict_proba(X_val_final)[:,1]
print('For best alpha {0}, The Cross validated AUC score is {1}'.format(best_alpha, roc_auc_score(y_val,y_pred_prob) ))
y_pred_prob = logreg_sig_clf.predict_proba(X_test_final)[:,1]
print('For best alpha {0}, The Test AUC score is {1}'.format(best_alpha, roc_auc_score(y_test,y_pred_prob) ))
y_pred = logreg.predict(X_test_final)
print('The test AUC score is :', roc_auc_score(y_test,y_pred_prob))
print('The percentage of misclassified points {:05.2f}% :'.format((1-accuracy_score(y_test, y_pred))*100))
import statsmodels.discrete.discrete_model as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import *
correlations=X_train_final.corr()
sns.heatmap(correlations)
def corr_df(x, corr_val):
'''
Obj: Drops features that are strongly correlated to other features.
This lowers model complexity, and aids in generalizing the model.
Inputs:
df: features df (x)
corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
Output: df that only includes uncorrelated features
'''
# Creates Correlation Matrix and Instantiates
corr_matrix = x.corr()
iters = range(len(corr_matrix.columns) - 1)
drop_cols = []
# Iterates through Correlation Matrix Table to find correlated columns
for i in iters:
for j in range(i):
item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
col = item.columns
row = item.index
val = item.values
if val >= corr_val:
# Prints the correlated feature set and the corr val
print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
drop_cols.append(i)
drops = sorted(set(drop_cols))[::-1]
# Drops the correlated columns
for i in drops:
col = x.iloc[:, (i+1):(i+2)].columns.values
df = x.drop(col, axis=1)
return df
x_train_in=corr_df(X_train_final,0.9)
x_train_in
print('target shape',y_train.shape)
y_train
print('target shape',y_test.shape)
y_test
accoridng to the Explanatory variable table, and the correlation calculated above, we selected explanatory variables and the training data.
# x_train_in = pd.DataFrame(X_train_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','clos_active_prop','cons_cred_card_prop','paid_on_time'])
x_train_in = pd.DataFrame(X_train_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','cons_cred_card_prop','paid_on_time'])
logit = sm.Logit(y_train, x_train_in.fillna(0)).fit()
logit.summary()
x_train_in2 = pd.DataFrame(X_train_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','cons_cred_card_prop','paid_on_time','FLOORSMIN_MEDI','EXT_SOURCE_#'])
logit = sm.Logit(y_train, x_train_in2.fillna(0)).fit()
x_train_in3= pd.DataFrame(X_train_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','cons_cred_card_prop','paid_on_time','EXT_SOURCE_3'])
logit = sm.Logit(y_train, x_train_in3.fillna(0)).fit()
x_test_fianl = pd.DataFrame(X_test_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','cons_cred_card_prop','paid_on_time'])
x_test_fianl
len(logit.predict(x_test_fianl))
# Mark it as a 1 if logit prediction for up is above 50%, and 0 otherwise
predict_label = pd.DataFrame(np.zeros(shape=(4348,1)), columns = ['label'])
predict_label.iloc[logit.predict(x_test_fianl)>0.5] = 1
predict_label
Interpret the estimated coefficients of the logistic regression model. Explain their impact on the response variable in plain English.
### YOUR CODE HERE ###
# that's explanation can be very long
Calculate the confusion matrix using the testing set.
### YOUR CODE HERE ###
#Look into the function and corresponding parameters; confusion_matrix
def plot_confusion_matrix(test_y, predicted_y):
# Confusion matrix
C = confusion_matrix(test_y, predicted_y)
# Recall matrix
A = (((C.T)/(C.sum(axis=1))).T)
# Precision matrix
B = (C/C.sum(axis=0))
plt.figure(figsize=(20,4))
labels = ['Re-paid(0)','Not Re-paid(1)']
cmap=sns.light_palette("purple")
plt.subplot(1,3,1)
sns.heatmap(C, annot=True, cmap=cmap,fmt="d", xticklabels = labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Orignal Class')
plt.title('Confusion matrix')
plt.subplot(1,3,2)
sns.heatmap(A, annot=True, cmap=cmap, xticklabels = labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Orignal Class')
plt.title('Recall matrix')
plt.subplot(1,3,3)
sns.heatmap(B, annot=True, cmap=cmap, xticklabels = labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Orignal Class')
plt.title('Precision matrix')
plt.show()
plot_confusion_matrix(y_test, predict_label['label'])
plot_confusion_matrix(y_test, y_pred)
Explain a weakness or a strength of the model based on the confusion matrix.
### YOUR EXPLAINATION HERE ###
Plot the ROC curve (receiver operating characteristic curve) showing the performance of the model. Calculate AUC (Area under the curve) respectively
### YOUR CODE HERE ###
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test,y_pred_prob)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, marker='.')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.title('ROC curve', fontsize = 20)
plt.xlabel('FPR', fontsize=15)
plt.ylabel('TPR', fontsize=15)
plt.grid()
plt.legend(["AUC=%.3f"%auc])
plt.show()
### YOUR CODE HERE ###
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, predict_label['label'])
auc = roc_auc_score(y_test,predict_label['label'])
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, marker='.')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.title('ROC curve', fontsize = 20)
plt.xlabel('FPR', fontsize=15)
plt.ylabel('TPR', fontsize=15)
plt.grid()
plt.legend(["AUC=%.3f"%auc])
plt.show()
Compute the following measures to assess the performance of the logistic regression model: precision, recall, F_1-score, accuracy, and total misclassification rate; using the testing set.
### YOUR CODE HERE ###
#Given the confusion matrix result, note that this can be computed manually using formula as well
If you were to only choose two of the five performance measures for reporting, which ones do you choose? Justify your choice.
### YOUR EXPLAINATION HERE ###
Select an alternative classification model that has at least one tunable hyper-parameter.
HINT !
You might want to try random forest or boosted tree etc.
Train your alternative model on the same training set.
Selection of features
import re
## to solve issue with nameing
train_features, valid_features, train_y, valid_y = train_test_split(X_train_final, y_train, test_size = 0.15, random_state = 42)
train_features = train_features.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
valid_features = valid_features.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
X_train_final = X_train_final.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
import pickle
import lightgbm as lgb
model_sk = lgb.LGBMClassifier(boosting_type='gbdt', max_depth=7, learning_rate=0.01, n_estimators= 2000,
class_weight='balanced', subsample=0.9, colsample_bytree= 0.8, n_jobs=-1)
model_sk.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)], eval_metric = 'auc', verbose = 200)
feature_imp = pd.DataFrame(sorted(zip(model_sk.feature_importances_, X_train_final.columns)), columns=['Value','Feature'])
features_df = feature_imp.sort_values(by="Value", ascending=False)
selected_features = list(features_df[features_df['Value']>=50]['Feature'])
# Saving the selected features into pickle file
with open('select_features.txt','wb') as fp:
pickle.dump(selected_features, fp)
print('The no. of features selected:',len(selected_features))
# Feature importance Plot
import plotly.offline as py
import plotly.graph_objs as go
data1 = features_df.head(20)
data = [go.Bar(x =data1.sort_values(by='Value')['Value'] , y = data1.sort_values(by='Value')['Feature'], orientation = 'h',
marker = dict(
color = 'rgba(43, 13, 150, 0.6)',
line = dict(
color = 'rgba(43, 13, 150, 1.0)',
width = 1.5)
)) ]
layout = go.Layout(
autosize=False,
width=1300,
height=700,
title = "Top 20 important features",
xaxis=dict(
title='Importance value'
),
yaxis=dict(
automargin=True
),
bargap=0.4
)
fig = go.Figure(data = data, layout=layout)
fig.layout.template = 'seaborn'
py.iplot(fig)
data1
alpha = [200,500,1000,2000]
max_depth = [7, 10]
cv_auc_score = []
for i in alpha:
for j in max_depth:
clf = RandomForestClassifier(n_estimators=i, criterion='gini', max_depth=j,class_weight='balanced',
random_state=42, n_jobs=-1)
clf.fit(X_train_final, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train_final, y_train)
y_pred_prob = sig_clf.predict_proba(X_val_final)[:,1]
cv_auc_score.append(roc_auc_score(y_val,y_pred_prob))
print('For n_estimators {0}, max_depth {1} cross validation AUC score {2}'.
format(i,j,roc_auc_score(y_val,y_pred_prob)))
best_alpha = np.argmax(cv_auc_score)
print('The optimal values are: n_estimators {0}, max_depth {1} '.format(alpha[int(best_alpha/2)],
max_depth[int(best_alpha%2)]))
rf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/2)], criterion='gini', max_depth=max_depth[int(best_alpha%2)],
class_weight='balanced', random_state=42, n_jobs=-1)
rf.fit(X_train_final, y_train)
rf_sig_clf = CalibratedClassifierCV(rf, method="sigmoid")
rf_sig_clf.fit(X_train_final, y_train)
y_pred_prob = rf_sig_clf.predict_proba(X_train_final)[:,1]
print('For best n_estimators {0} best max_depth {1}, The Train AUC score is {2}'.format(alpha[int(best_alpha/2)],
max_depth[int(best_alpha%2)],roc_auc_score(y_train,y_pred_prob)))
y_pred_prob = rf_sig_clf.predict_proba(X_val_final)[:,1]
print('For best n_estimators {0} best max_depth {1}, The Validation AUC score is {2}'.format(alpha[int(best_alpha/2)],
max_depth[int(best_alpha%2)],roc_auc_score(y_val,y_pred_prob)))
y_pred_prob = rf_sig_clf.predict_proba(X_test_final)[:,1]
print('For best n_estimators {0} best max_depth {1}, The Test AUC score is {2}'.format(alpha[int(best_alpha/2)],
max_depth[int(best_alpha%2)],roc_auc_score(y_test,y_pred_prob)))
y_pred = rf_sig_clf.predict(X_test_final)
print('The test AUC score is :', roc_auc_score(y_test,y_pred_prob))
print('The percentage of misclassified points {:05.2f}% :'.format((1-accuracy_score(y_test, y_pred))*100))
plot_confusion_matrix(y_test, y_pred)
Use a k-fold cross validation to tune the hyper-parameters of the alternative model based on accuracy.
### YOUR CODE HERE ###
cv_auc_score = []
max_depth = [3, 5, 7, 10]
for i in max_depth:
params = {'boosting_type': 'gbdt',
'max_depth' : i,
'objective': 'binary',
'nthread': 5,
'num_leaves': 32,
'learning_rate': 0.05,
'max_bin': 512,
'subsample_for_bin': 200,
'subsample': 0.7,
'subsample_freq': 1,
'colsample_bytree': 0.8,
'reg_alpha': 20,
'reg_lambda': 20,
'min_split_gain': 0.5,
'min_child_weight': 1,
'min_child_samples': 10,
'scale_pos_weight': 1,
'num_class' : 1,
'metric' : 'auc'
}
lgbm = lgb.train(params,
train_data,
2500,
valid_sets=valid_data,
early_stopping_rounds= 100,
verbose_eval= 10
)
y_pred_prob = lgbm.predict(X_val_final)
cv_auc_score.append(roc_auc_score(y_val,y_pred_prob))
print('For max_depth {0} and some other parameters, cross validation AUC score {1}'.format(i,roc_auc_score(y_val,y_pred_prob)))
print('The optimal max_depth: ', max_depth[np.argmax(cv_auc_score)])
Calculate the confusion matrix using the testing set.
### YOUR CODE HERE ###
params = {'boosting_type': 'gbdt',
'max_depth' : max_depth[np.argmax(cv_auc_score)],
'objective': 'binary',
'nthread': 5,
'num_leaves': 32,
'learning_rate': 0.05,
'max_bin': 512,
'subsample_for_bin': 200,
'subsample': 0.7,
'subsample_freq': 1,
'colsample_bytree': 0.8,
'reg_alpha': 20,
'reg_lambda': 20,
'min_split_gain': 0.5,
'min_child_weight': 1,
'min_child_samples': 10,
'scale_pos_weight': 1,
'num_class' : 1,
'metric' : 'auc'
}
lgbm = lgb.train(params,
train_data,
2500,
valid_sets=valid_data,
early_stopping_rounds= 100,
verbose_eval= 10
)
y_pred_prob = lgbm.predict(X_train_final)
print('For best max_depth {0}, The Train AUC score is {1}'.format(max_depth[np.argmax(cv_auc_score)],
roc_auc_score(y_train,y_pred_prob) ))
y_pred_prob = lgbm.predict(X_val_final)
print('For best max_depth {0}, The Cross validated AUC score is {1}'.format(max_depth[np.argmax(cv_auc_score)],
roc_auc_score(y_val,y_pred_prob) ))
y_pred_prob = lgbm.predict(X_test_final)
print('For best max_depth {0}, The Test AUC score is {1}'.format(max_depth[np.argmax(cv_auc_score)],
roc_auc_score(y_test,y_pred_prob) ))
y_pred = np.ones((len(X_test_final),), dtype=int)
for i in range(len(y_pred_prob)):
if y_pred_prob[i]<=0.5:
y_pred[i]=0
else:
y_pred[i]=1
print('The test AUC score is :', roc_auc_score(y_test,y_pred_prob))
print('The percentage of misclassified points {:05.2f}% :'.format((1-accuracy_score(y_test, y_pred))*100))
plot_confusion_matrix(y_test, y_pred)
Plot the ROC curve (receiver operating characteristic curve) showing the performance of the model. Calculate AUC (Area under the curve) respectively
### YOUR CODE HERE ###
Use the test data to compare the logistic regression model and the tuned alternative model.
Choose a suitable plot and illustrate all the five performance measures (precision, recall, F_1-score, accuracy, and total misclassification rate) for the two models. Use appropriate figure size, title/caption, legend, and axis titles.
### YOUR CODE HERE ###
#bar plot
Among the two models you have developed, what is you suggested model to Home Credit Group (client of the project), if they are more concerned about rejecting applications of individuals who may otherwise end up having difficulty in repayment? Provide a numeric measure for each of the two models to justify your response.
### YOUR EXPLAINATION HERE ###
Among the two models you have developed, what is you suggested model to Home Credit Group (client of the project), if they are more concerned about not rejecting applications of individuals who may be capable of repayment according to the data? Provide a numeric measure for each of the two models to justify your response.
### YOUR EXPLAINATION HERE ###